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Nonlinear dynamical systems store and gen- 
erate information — they intrinsically compute. 
Real computing devices use nonlinearity to do 
the same, except that they are designed to com- 
pute — the information serves some utility or func- 
tion determined by the designer. Intuitively, use- 
ful computing devices must be constructed out of 
(physical, chemical, or biological) processes that 
have some minimum amount of intrinsic compu- 
tational capability. However, the exact relation- 
ship between intrinsic and designed computation 
remains elusive. In fact, bridging intrinsic and de- 
signed computation requires solving a number of 
intermediate problems. One is to understand the 
diversity of intrinsic computation of which nonlin- 
ear dynamical systems are capable. Another is to 
determine if one can practically manipulate these 
systems in the service of functional information 
generation and storage. 

Here, we address both of these problems from 
the perspective of information theory. We de- 
scribe new information processing characteristics 
of dynamical systems and the stochastic processes 
they generate. We focus particularly on two key 
aspects that impact design: synchronization and 
control. Synchronization concerns how we come 
to know the hidden states of a process through 
observations; while control, how we can manipu- 
late a process into a desired internal condition. 



I. INTRODUCTION 

Given a model of a stationary stochastic process, how 
much information must one extract from observations to 
exactly know which state the process is in? With this, an 
observer is said to be synchronized to the process. (For 
an introduction to the problem, see Ref. [T].) 

Given that one has designed a stochastic process, is 
there a series of inputs that reliably drive it to a desired 
internal condition? If so, the designed process is said to 
be controllable. 

Synchronization and control are dual to each other: In 
synchronization, an observer attempts to predict the pro- 
cess's internal state from incomplete and indirect obser- 
vations, typically starting with complete ignorance and 
hopefully ending with complete certainty. In control, one 
must extract from the design a series of manipulations, 
typically indirect, that will drive the process to a de- 
sired state or set of states. The duality is simply that 
the observer's measurements can be interpreted as the 
designer's control inputs. 

Synchronization and control are key aspects in intrin- 



sic and designed computation, both to detecting intrin- 
sic computation in dynamical systems and to leverag- 
ing a dynamical system's intrinsic computation into use- 
ful computation. For the latter, the circuit designer at- 
tempts to build circuits, themselves dynamical systems, 
that synchronize to incoming signals. 

For example, even the most mundane initial opera- 
tion is essential: When power is first applied, a digital 
computer must predictably reach a stable and repeat- 
able state, without necessarily being able to perform even 
small amounts of digital intelligent control or analysis 
of its changing environment. Without reliably reaching 
a stable condition — now a quite elaborate operation in 
modern microprocessors — no useful information process- 
ing can be initiated. The device is still a dynamical sys- 
tem, of course, but it fails at raising itself from that pro- 
saic condition to the level of a computing device. 

Once digital computing operations have commenced, 
similar concerns arise in the timing and control of infor- 
mation being loaded from memory into a register. Not 
only must each data bus line synchronize properly or risk 
misconstruing the voltage level offered up by the wires, 
but this must happen simultaneously across a number of 
component devices — quite wide buses, 128 and 256 lines 
are not uncommon today. 

Stepping back a bit, one must wonder what tools dy- 
namical systems theory itself provides to analyze and de- 
sign computation. Indeed, many of the properties of- 
ten used to characterize and classify dynamical systems 
are time-asymptotic — the Kolmogorov-Sinai entropy or 
Shannon entropy rate, the spectrum of Lyapunov char- 
acteristic exponents, the fractal and information dimen- 
sions (which rely on the asymptotic invariant measure), 
come to mind. However, real computing is not asymp- 
totic. Individual logic gates, as dynamical systems, de- 
liver their results on the short term. Indeed, the faster 
they do this, the better. 

How can we bridge the gap between dynamical sys- 
tems theory and the need to characterize the short term 
properties of dynamical systems? A suggestive exam- 
ple is found in the analysis of escape rates |2], a prop- 
erty of transient, short-term behavior. Another answer 
is found in synchronization and controllability, as they 
too are properties of the short term behavior of dynam- 
ical systems. We will show that there is a connection 
between these properties and the more typical asymp- 
totic view of dynamical behavior: Synchronization and 
control are determined by the nature of convergence to 
the asymptotic — they are our subject. 

Given the duality between synchronization and con- 
trol, in the following we present results in terms of only 
one notion — synchronization. The results apply equally 
well to control, though with different interpretations. 



A. Precis 

Analyzing informational convergence properties is the 
main strategy we will use. However, as we will see, differ- 
ent properties converge differently from each other, either 
for a given process or as one looks across a family of pro- 
cesses. Moreover, for a given process we will consider a 
family of different representations of it. The result, while 
giving insight into informational properties and how rep- 
resentations can distort them, ends up being a rather 
elaborate classification scheme. To reduce the apparent 
complication, it will be helpful to give a detailed sum- 
mary of the steps we employ in the development. 

After describing related work, we review the use of 
Shannon block entropy and related quantities, analyz- 
ing their asymptotic behavior and aspects of conver- 
gence. We introduce a single framework — the conver- 
gence hierarchy — to call out the systematic nature of con- 
vergence properties. 

We then take a short detour to introduce the range 
of possible descriptions a process can have, noting their 
defining properties. One, the e-machine, plays a partic- 
ularly central role, as it allows one to calculate all of a 
process's intrinsic properties. Other descriptions typi- 
cally do not allow this to the same broad extent. 

With a model in hand, one can start to discuss how 
one synchronizes to its states. When the model is the 
e-machine, one can speak of synchronizing to the process 
itself. To do this, we analyze the convergence properties 
of two new entropies: the state-block entropy and the 
block-state entropy. We establish their general asymp- 
totic properties, introducing convergence hierarchies of 
their own, paralleling that for the block entropy. For 
finitary processes, the latter converges from below, but 
the new block-state entropy converges from above to the 
same asymptote. One benefit is that estimation methods 
can be improved through use of bounds from above and 
below. 

When we specialize to the e-machine, we establish 
a direct connection between synchronization and how 
the block entropies converge. We provide an infor- 
mational measure — synchronization information — that 
summarizes the total uncertainty encountered during 
synchronization. We relate this back to the transient in- 
formation introduced previously, which derives only from 
the observed sequences, not requiring a model or a notion 
of state. Along the way, we discuss a process's Markov 
order — the scale at which "asymptotic" statistics set in — 
and its cryptic order — the length scale over which inter- 
nal state information is spread. These scales control syn- 
chronization. 

The development then, step-by-step, relaxes the 
e-machine's defining properties in order to explore an in- 



creasingly wide range of models. A particular emphasis 
in this is to show how nonoptimal models bias estimates 
of a process's informational properties. Conversely, we 
learn how certain classes of models, some widely used in 
mathematical statistics and elsewhere, make strong as- 
sumptions and, in some cases, preclude the estimation of 
important process properties. 

Starting with the class of minimal, optimally predictive 
models that synchronize (finitary e-machines), we first 
relax the minimality assumption. We show that need- 
less model elaborations — such as more, but redundant 
states — can affect synchronization. We identify that class 
which still does synchronize. Then, we consider nonmin- 
imal unifilar, nonsynchronizing models. Finally, we relax 
the unifilarity assumption. At each stage, we see how the 
convergence properties of the various entropies change. 
These changes, in turn, induce a number of informational 
measures of what the models themselves contribute to a 
process's now largely-apparent information processing. 

A key tool in the analysis takes advantage of the 
fact that the various multivariable information quantities 
form a signed measure [^. Their visual display, a form 
of Venn diagram called an information diagram, brings 
some order to the notation and classification chaos. 



B. Synchronization and Control: Related Work 

Controlling dynamical systems and stochastic pro- 
cesses has an extensive history. For linear dynamical 
systems see, for example, Ref. J4 and for hidden Markov 
models see, for example, Ref. [5]. More recently, there 
has been much work on controlling nonlinear dynami- 
cal systems, a markedly more difficult problem in its full 
generality; see Refs. [6HS1- 

Synchronization, too, has been very broadly studied 
and for much longer, going back at least to Huygens [Hi- 
lt is also an important property of symbolic dynamical 
systems [lOj . It has even become quite popularized of 
late, being elevated to a general principle of natural or- 
ganization |llj . 

Here, we consider a form of synchronization that 
is, at least at this point, distinct from the dynamical 
kind. Moreover, we take a complementary, but dis- 
tinct approach — that of information theory — to address 
control and synchronization. This was introduced in 
Ref. [12] and several applications are given in Refs. [TlfT^. 
A roughly similar problem setting for synchronization is 
found in Ref. [H]. We note that the closely related top- 
ics of state estimation and control are addressed in infor- 
mation theory [iSl [H] , nonlinear dynamics [TTHTO] , and 
Markov decision processes [20] . 

Adapting the present approach to continuous dynam- 



ical systems and stochastic processes remains a future 
effort. For the present, the closest connections will be 
found to the work cited above on hidden Markov models 
and symbolic dynamical systems. 

II. BLOCK ENTROPY AND ITS 
CONVERGENCE HIERARCHY 



on their invariant sets. The notions also apply equally 
well to one-dimensional spatial configurations of spin sys- 
tems and of deterministic and probabilistic cellular au- 
tomata, where one interprets the spatial coordinate as a 
"time". 

B. Block Entropy 



It is an interesting fact, perhaps now intuitive, that to 
estimate even the randomness of an information source, 
one must also estimate it's internal structure. Ref. |12j 
gives a review of this interdependence and it serves as a 
starting point for our analysis of synchronization, which 
is a question about coming to know the source's states 
from observations. Indeed, if one has to make estimates 
of internal organization just to get to randomness, then 
one, in effect and without too much more effort, can also 
address issues of synchronization. There is an intimate 
relationship that we hope to establish. 

We briefly review Ref. [Hj, largely to introduce no- 
tation and highlight the main ideas needed for synchro- 
nization. This review and our development of synchro- 
nization requires the reader to be facile with information 
theory at the level of the first half of Ref. [51] , signed in- 
formation measures and information diagrams of Ref. [3] , 
and their uses in Refs. 



A. Stationary Stochastic Processes 

The approach in Ref. [H] starts simply: Any sta- 
tionary process, V, is a joint probability distribution 
Pr(A, a) over past and future observations. This dis- 
tribution can be thought of as a communication chan- 
nel with a specified input distribution, Pr(A). It trans- 
mits information from the past X = . . . X^^X^2X-i to 
the future X — X0X1X2 ... by storing it in the present. 
Xt is the random variable for the measurement outcome 
at time t\ the lowercase Xt denotes a particular value. 
Throughout this work, we always use X and X in the 
limiting sense. That is, we work with length-L sequences 
or blocks of random variables: Xi = XfXt+i ■ ■ ■ Xt+L-i 
and take the limit as L approaches infinity. 

In the following, we consider only discrete measure- 
ment outcomes — x e ^ = {1, 2, . . . , k} — and stationary 
processes — Pr(X/') = Pr(X(f ), for all times t and block 
lengths L. Unlike some definitions of stationarity, this 
makes no assumptions about the process's internal start- 
ing conditions, as such knowledge obviates the very ques- 
tion of synchronization. 

Such processes include those found in the field of 
stochastic processes, of course, but one also has in mind 
the symbolic dynamics of continuous-state continuous- 
time or continuous-state discrete-time dynamical systems 



One measure of the diversity of length-L sequences 
generated by a process is its Shannon block entropy: 



H{L) 



H[Xo] 



^ Pr(u') log2 Pr(u;) 



(1) 
(2) 



toG^-t 



where w = XqXi . . . xl-i is a word in the set A^ of 
length-L sequences. It has units of [bits] of informa- 
tion. One can think of the block entropy as a kind of 
transform that reduces a process's distribution over the 
(typically infinite) number of sequences to a function of 
a single variable L. In this view, Ref. [12] focused on 
a simple question: What properties of a process can be 
determined solely from its H{L)1 

C. Source Entropy Rate 

One of those properties, and historically the most 
widely used and technologically important, is Shannon's 
source entropy rate: 



lim 



H{L) 
L 



(3) 



The entropy rate is the irreducible unpredictability of a 
process's output — the intrinsic randomness left after one 
has extracted all of the correlational information from 
past observations. The difference between it and the al- 
phabet size, log2 1-4 1 — /i^, indicates how much the raw 
measurements can be compressed. More precisely. Shan- 
non's First Theorem states that the output sequences x^ 
from an information source can be compressed, without 
error, to Lh^ bits [U]. Moreover, Shannon's Second The- 
orem gives operational meaning to the entropy rate |21j : 
A communication channel's capacity must be larger than 
hf^ for error-free transmission. 

D. Excess Entropy 

As noted, any process — chaotic dynamical system, spin 
chain, cellular automata, to mention a few — can be con- 
sidered a channel that communicates its past to its fu- 
ture. The messages to be transmitted in this way are the 
pasts which the process can generate. Thus, the "capac- 



ity" of this channel is not something that one optimizes 
as done in Shannon's theory to engineer channels and 
construct error-free encodings. Rather, we think of it as 
how much of the process's channel is actually used. 

A process's channel utilization is another property that 
can be determined from the block entropy. It is called the 
excess entropy and is defined, closely following Shannon's 
channel capacity definition, by: 



E 



/[^;^] 



(4) 



where I[Y] Z] is the mutual information between random 
variables Y and Z. It has units of [bits] and tells one 
how much information the output shares with the input 
and so measures how much information is transmitted 
through a, possibly noisy, channel. 



E. Block Entropy Asymptotics 

It has been known for quite some time now that the 
entropy rate and excess entropy control the asymptotic 
behavior (L — > oo) of a finitary process's block entropy. 
Specifically, it scales according to the linear asymptote: 



H{L) ocE- 



hf_,L 



Specifically, 



lim {H{L) - Lhf, 



(5) 



(6) 



E is the sublinear part of H{L). This gives important 
general insight into the block entropy's behavior. It is 
also quite practical, though: If H{L) actually meets the 
asymptote at some finite sequence length i?, then the 
process is effectively an order- i? Markov chain [T^l 121] : 
Pr(Xo|-^) — ^T:{XQ\X^fj). Interestingly, many finitary 
processes do not reach the asymptote at finite lengths 
and so cannot be recast as Markov chains of any order. 
Roughly speaking, they have various kinds of infinite- 
range correlation. 



where A is the discrete derivative with respect to block 
length L. It is easy to see that the right-hand side is the 
conditional entropy H[Xl-i\Xq " ] and that, in turn. 



h^ = lim H[Xl-i\X^-^] 



= H[Xo\Xa] 



(8) 
(9) 



recovering the entropy rate. It is often useful to directly 
refer to the length-L approximation to the entropy rate as 
hf^{L) = H[Xl~i\Xq^^]. hf^{L) > /i^ and so it converges 
from above. 

The excess entropy, for its part, controls the conver- 
gence speed, as it is the discrete integral: 



E 



J2ih^iL)^h^) 



(10) 



L=l 



It requires only a few steps to see that this form is equiv- 
alent to that of Eq. Q . 

Following a similar strategy, the discrete integral 



L=0 



h^L 



H{L)] 



(11) 



measures how H{L) itself reaches its linear asymptote 
E -|- h^L. T is called the transient information and it is 
implicated in determining the Markov order and, as we 
will show, synchronization. 

The pattern should be clear now: At the lowest level, 
the transient information indicates how quickly the block 
entropy reaches its asymptote. Then, that asymptote 
grows at the rate ft.^ and has y-intercept E. It might 
be helpful to refer to the graphical summary of block- 
entropy convergence and the associated information mea- 
sures given in Ref. |T21 Fig. 2]. Analogous diagrams will 
appear shortly. 

All this can be compactly summarized by introducing 
two operators: a derivative and an integral that operate 
on H{L). The derivative operator at the n*^-level is: 



A"iJ[Xo^] = A'^-^H[X^] - l\'^-^H[X^-^] 



(12) 



F. The Convergence Hierarchy 

In this way, the study of how the block entropy con- 
verges, or does not, is a tool for classifying processes. 
Reference [T2] showed that the entropy rate and excess 
entropy are merely two players in an infinite hierarchy 
that determines the shape of H{L). The central idea is 
to take L-derivatives and integrals of H{L). 

To start, one has the block entropy difference: 



Ai/[Xo^] ^ H[X;^] H[X;^-'] 



(7) 



for L > n = 1, 2, . . . and for L > ri = 0, 



^'H[Xl^] EE H[X^] 



The integral operator is: 



E 

L—n 



A-H[X^] ~ lim A-H[X',] 



(13) 



(14) 



n = 0, 1, 2, . . .. (This is a slight deviation from Ref. 
when n = 2. See App. [A}) 



To make the connection with what we just discussed, 
in this notation we have: 



h^ = lim A^H[X^] , 

L— >-oo 

E = Zi , and 
T = -Zo • 



(15) 

(16) 
(17) 



Additionally, I2 is a process's total predictability G and 
A'^H[Xq] is its predictability gain — the rate at which 
predictions improve by going to longer sequences. 

The two operators, A„ and X„, define the entropy con- 
vergence hierarchy for a process, capturing those proper- 
ties reflected in the process's block entropy. Given a pro- 
cess's specification, one attempts to calculate the hierar- 
chy analytically; given data, to estimate it empirically. In 
addition to systematizing a process's informational prop- 
erties, the hierarchy has a number of uses. For example, 
structural classes of processes can be distinguished by the 
n* at which the hierarchy becomes trivial; for example, 
when A'"H[Xq] =0, n > n* . Other classifications turn 
on bounded I„.. The finitary processes, for example, 
are defined by n* = 1: Ii = E < 00. Or, conversely, 
there are well known processes for which some integrals 
diverge; they include the onset of chaos through period- 
doubling, where the excess entropy diverges. Reference 
[121 Sec. VILA] introduces a classification of processes 
along these lines. 



III. PROCESS PRESENTATIONS 

A. The Causal State Representation 

Prediction is closely allied to the view of a process as 
a communication channel: We wish to predict the future 
using information from the past. At root, a prediction 
is probabilistic, specified by a distribution of possible fu- 
tures X given a particular past x : Pr(A | a: ). At a min- 
imum, a good predictive model needs to capture all of 
the information / shared between the past and future: 
E = /[^;^]. 

Consider now the goal of modeling — building a repre- 
sentation that allows not only good prediction but also 
expresses the mechanisms producing a system's behav- 
ior. To build a model of a structured process (a memo- 
ryful channel), computational mechanics 25J introduced 
an equivalence relation x ^ x ' that clusters all histories 
which give rise to the same prediction: 

e(^) = {f : Pr(^l^) = Pr(^|^')} • (18) 

In other words, for the purpose of forecasting the fu- 
ture, two different pasts are equivalent if they result in 



the same prediction. The result of applying this equiva- 
lence gives the process's causal states S = Pr(A , X)/ ~, 
which partition the space X of pasts into sets that are 
predictively equivalent. The set of causal states [26] can 
be discrete, fractal, or continuous; see, e.g., Figs. 7, 8, 
10, and 17 in Ref. [27|. 

(x) 

State-to-state transitions are denoted by matrices T^J, 
whose elements give the probability Pr(A ~ x,S'\S) of 
transitioning from one state S to the next 5' on see- 
ing measurement x. The resulting model, consisting of 
the causal states and transitions, is called the process's 
e-machine. Given a process V, we denote its e-machine 
by MiV). 

Causal states have a Markovian property that they ren- 
der the past and future statistically independent; they 
shield the future from the past [5S1: 



Pr(^,^|5) =Pr(^|5)Pr(^|5) . 



(19) 



Moreover, they are optimally predictive [5S] in the sense 
that knowing which causal state a process is in is just as 
good as having the entire past: Pr(A |5) = Pr(A |a). In 
other words, causal shielding is equivalent to the fact [2S] 
that the causal states capture all of the information 
shared between past and future: I[S; X] = E. 

e-Machines have an important structural property 
called unifilarity [551 [HI- From the start state, each 
symbol sequence corresponds to exactly one sequence 
of causal states [30| . The importance of unifiliarity, as 
a property of any model, is reflected in the fact that 
representations without unifilarity, such as generic hid- 
den Markov models, cannot be used to directly calculate 
important system properties — including the most basic, 
such as how random a process is. As a practical result, 
unifilarity is easy to verify: For each state, each measure- 
ment symbol appears on at most one outgoing transition 
[31) . Thus, the signature of unifilarity is that on knowing 
the current state St and measurement Xt, the uncertainty 
in the next state St+i vanishes: H[St+i\St,Xt] = 0. 

Out of all optimally predictive models "JZ- — for which 
I[}Z; X] — E — the e-machine captures the minimal 
amount of information that a process must store in or- 
der to communicate all of the excess entropy from the 
past to the future. This is the Shannon information 
contained in the causal states — the statistical complex- 



ity 



Cf_, = H[S] < H[TZ]. It turns out that statisti- 



cal complexity upper bounds the excess entropy [^ [53] '■ 
E < C^. In short, E is the effective information trans- 
mission rate of the process, viewed as a channel, and C^ 
is the memory stored in that channel. 

Combined, these properties mean that the e-machine 
is the basis against which modeling should be compared, 
since it captures all of a process's information at maxi- 



mum representational efficiency. 

Importantly, due to unifilarity one can calculate the 
entropy rate directly from a process's e-machine: 



h^ = H[X\S] 

{S} {xS'} {S'} 



rp(x) 



(20) 



Pr(5) is the asymptotic probability of the causal states, 
which is obtained as the normalized principal eigenvector 
of the transition matrix T — ^Sr^xT'^^y A process's 
statistical complexity can also be directly calculated from 
its e-machine: 



C^ = H[S] 



= -5]Pr(5)log2Pr(5) 

{5} 



(21) 



Thus, the e-machine directly gives two important proper- 
ties: a process's rate {h^) of producing information and 
the amount (C^) of historical information it stores in do- 
ing so. Moreover, Refs. [221 123] showed how to calculate 
a process's excess entropy directly from the e-machine. 

B. General Presentations 

The e-machine is only one possible description of a pro- 
cess. There are many alternatives: Some larger, some 
smaller; some with the same prediction error, some with 
larger prediction error; some that are unifilar, some not; 
some that do an excellent job of capturing Pr(A,A), 
many (or most) doing only an approximate job; some 
allowing for the direct calculation of the process's prop- 
erties, some precluding such calculations. 

The e-machine, compared to all other possible descrip- 
tions, is arguably the best. The results in the follow- 
ing, as an ancillary benefit, strengthen this conclusion 
considerably. However, it is important to keep in mind 
that due to implementation constraints or intended use or 
under specified performance criteria, alternative models 
may be desirable and preferred to the e-machine. Refer- 
ence j?7] . for example, compares the benefits and disad- 
vantages of different kinds of nonunifilar models that are 
smaller than the e-machine. We return to elaborate on 
this in Sec. IVnDl 

One refers to a process's possible descriptions as 
presentations |32| . Specifically, these are state-based 
models — using states and state transitions — that exactly 
describe Pr(A, a). That is, given a finitary process 7^, 
we consider the set of all presentations that generate the 
same process language: Pr(X^), L = 1, 2, . . .. The set of 
T^'s presentations is the focus of our work here. That is, 
we do not address models that give only approximations 



to the process language. 

We refer to these alternative models as rivals. A rival 
consists of a set 7?. of states and state-to-state transitions 
-^KK' over the symbols s in the process's measurement al- 
phabet A. There is an associated mapping rj : x —^ TZ 
from pasts to rival states. When we refer to the rival's 
state as a random variable, we will denote this 7?. = rj{X). 
We use lower case p when we refer to a particular realiza- 
tion: TZ = p, p (i 7Z. Just as with the e-machine, given 
a rival presentation, we can refer to the amount of infor- 
mation the rival states contain — this is the presentation 
state entropy H[TZ]. 

Above, we noted that a process's e-machine is its mini- 
mal unifilar presentation. But, how are the rivals related, 
if at all, to the e-machine? To explore the organization 
of the space of rivals, in the following we relax properties 
that make the e-machine unique, working with presenta- 
tions that are nonminimal unifilar and those that are not 
even unifilar. And so, we must distinguish several kinds 
of presentation. First, we extend unifilarity to presenta- 
tions, generally. 

Definition 1. A presentation is unifilar if and only if 

H[nt+i\nt,Xt]^o. 

Second, we introduce the notion of reverse-time unifilar- 
ity. 

Definition 2. A presentation is counifilar if and only if 

H[nt\Xt,nt+i]^o. 

Third, we will consider prescient presentations, those 
whose states are as good at predicting as the e-machine's 
causal states |28l ES . 



Definition 3. A presentation is prescient if and only if 
for all pasts a; G X .• 

Pr(^^|7^ = ??(t)) = Pr(^^|5 = e(^)) , (22) 

for all L > 1,2,3,.... 

We will also shortly discuss presentations to which one 
can or cannot synchronize — that are or are not control- 
lable. 

IV. STATE-BLOCK AND BLOCK-STATE 
ENTROPIES 



Now, we introduce two block entropies and discuss 
their properties, but first, we recall some well known re- 
sults from information theory [2T1 Sec. 4.2]. 

For any stationary stochastic process, A7J[A(^] is a 
nonincreasing sequence of nonnegative terms that con- 
verges, from above, to the entropy rate /i^. There is 
a complementary result which provides an estimate of 



the entropy rate that converges from below. It is typ- 
ically stated in terms of the Moore (state-output) type 
of hidden Markov model [HI Thm. 4.5.1], so we recast 
the theorem in terms of the Mealy (edge-output) type of 
hidden Markov models, used exclusively here. 

Theorem 1. If TZq,TZi, . . . form a stationary Markov 
chain and {Xi,TZi+i) = (j){TZi), then 



H[XL\no,X^] <h^< H[Xl\X^] , (23) 



L = 0,l,2,. 



H[X^\no,^o]=h^ . (24) 



(j) need not be a deterministic mapping. 

Appendix |B] provides the proof details. Henceforth, we 
refer to H[TZo, Xq] as the state-block entropy. 

We also define the block-state entropy to be 
H[Xq,TZl]. As with the state-block entropy, there is 
a corresponding convergence result. 

Theorem 2. If TZq.TZi, . . . form a stationary Markov 
chain and (Xi,TZi^i) — (j)(TZi), then 

H[x^,nL]-H[x^'\nL-i] <h^< h[Xl\x^] , (25) 

i = 1, 2, 3, . . ., and 



lim ( H[X^,nL] - H[X^-\nL-i] ] = h. 



(26) 



Again, (j) need not be a deterministic mapping. 

Ref. j33] provides the proof of this theorem and dis- 
cusses related results in the context of crypticity and 
cryptic order |24j . 

Note, both of these theorems hold for general 
presentations — not just e-machines — and this fact serves 
as the motivation for our later generalizations. 

A. Convergence Hierarchies 

Just as with the block entropy H[Xq], we will consider 
L-dcrivatives and integrals of the state-block and block- 
state entropies. At the first level, 

Aif [7^o, ^o"^] = H[n^, x^] - H[no, x^-'] , (27) 

AH[X^,nL] = H[x^,nL] - H[X^-\nL-i] ■ (28) 



Higher-order derivatives are defined similarly to Eq. ( 12 1. 
As before, the n = case is an identity operator. So, for 
example, A°H[nQ,X^] = H[TZo,X^]. 

We already know — Thms. [l]and[2] — that both of these 
quantities tend to h^ in the large-L limit, ensuring that 
all higher-order derivatives tend to zero. 



Now, consider the n**^ state-block and block-state in- 
tegrals: 



/C„ = y^{A"H[TZo,X^] - lim A"iJ[7^o,^o]) , (29) 

L—n 
oo 

Jn = ^(A"ff[Xo^,7^L] - lim A"i7[Xo^7^,]) . (30) 



L—n 



€-i-oo 



Note that both /Co ^ and Jq > Q while, in contrast, 
lo < 0. Also, /Ci < and Ji < while Ii > 0. These 
differences are due to the fact that the block entropy is 
concave in L while the state-block and block-state en- 
tropies are convex. 

Consider the partial sums of /Ci — the state-block inte- 
gral: 



/Ci(L) = ^(AiJ[7^o, AoV^m) 

£=1 

= i/[7^o, X^] - H[no, X^] - Lh^ 

= H[X^\no]-Lh^. (31) 



Note that if the presentation is unifilar, then 
H[X/j'\n.o] = Lh^ and ICi{L) = 0. Thus, unifilarity is 
a sufficient condition for K-i = 0, but it is not a neces- 
sary condition. 

Now, consider the partial sums of Ji — the block-state 
integral: 



Jl(L)=^(Ai/[Xo^7^,]-/^, 



e=i 

= H[x^,nL] - H[x^,no] - Lh^ 
= H[x^,nL] - hixIul] - Lh^ 

= H[X^\nL\-Lh^. (32) 

Similarly, if the presentation is counifilar, then it follows 
that H[XI^\'Rl\ = and Ji{L) = 0. So, counifilarity is a 
sufficient condition for J7i = 0, but it is not a necessary 
condition. 

B. Asymptotics 

Theorems [l] and [2] tell us H[So, X^] and H[X^,Sl\ 
are convex functions in L and that the slope limits to the 
entropy rate. This means that each curve converges to a 
linear asymptote, cf. Eq. ([5|: 



H[no,X^]^YsB^ + h^L 
H[X^,TZL]^YBSE + hf,L , 



(33) 

(34) 



where Isbe and Ibse are constants independent of L. 
The pictures that one should have in mind for the growth 
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of these entropies are those of Figs. [T] [5] [8J [TT] and 14 
which we will discuss in due course. 

In fact, we wih take this behavior as the definition of 
the fohowing hnear asymptotes: 



lsBE= hm (H[no,X^]-h^L 

L— !-oo V 

= hm (h[Ro] + H[X^\no] - h^L^ ^^^^ 



and 



= H[no] + iCi 



Ybse= hm [H[X^,TZL]-h^L 



\im{H[TZL] + H[xI^\TZl] - h^L 
H[no] + Jx . 



(36) 



These teh us that K\ and J\ are not the subhnear parts of 
the state-block and block-state entropies. This is in con- 
trast to the corresponding result for the block entropies: 



Fee = lim H\Xl^\ 



hf^L 



hm H[X^,] + H[X,'^] 



h^L 



= ff[Xo"]+Zi. 



(37) 



(38) 
(39) 



The term H[Xq\ was dropped in the earlier partial sum 
formulation — i.e., Eq. (10) — since it corresponds to no 
measurement being made and so is zero. It is reintro- 
duced above, though, to complete the formal parallel to 
the state-block and block-state entropy cases. 

The result for block entropy is that the offset of the 
linear asymptote was equal to the Xi, the excess en- 
tropy. However, the argument just given clearly estab- 
lishes that, in fact, one should think of the first deriva- 
tives as offsets from the initial value of their correspond- 
ing curves. 

Finally, recall that /Ci and Ji are not greater than 
zero, so Fsbe and Ybse are less than or equal to the 
presentation state entropy H\TZq\. 

V. SYNCHRONIZATION 

A. Duality of Synchronization and Control 

Synchronization is a question about how an observer 
comes to know a process's (typically hidden) current 
internal state through observations. (Recall the pic- 
ture introduced in Ref. [1 .) As such, it requires a no- 
tion of state, either the process's causal state (using the 
e-machine) or the state of some other presentation. In 



either case we monitor the observer's uncertainty over 
the states IZ after having seen a series of measurements 
w — Xi)X2 ■ ■ -xl^i using the conditional state entropy 
H\R\w\. When this vanishes, the observer is synchro- 
nized and we call w a synchronizing word. 

During synchronization, the observer updates her an- 
swer to the question, "Which presentation states can be 
reached by sequence w?" When there is a unique answer, 
the observer is synchronized. If the eventual answer, 
though, is only a proper subset of presentation states, 
then < H\R.\'w\ < H[TZ] and the observer can be said 
to be partially synchronized. 

A formal treatment of synchronization appears in 
Refs. [34l |35] , which define asymptotic synchronization 
as follows. 

Definition 4. A presentation is weakly asymptotically 
synchronizing if and only if limj^^ca H[TZl\Xq] = 0. 

While some processes can have synchronizing words, 
others have synchronizing blocks where every word of a 
finite length i? is a synchronizing word. Such processes 
are called Markov processes. The smallest such R is the 
Markov order PH |3S] . It turns out that the e-machine 
presentation for a Markov process is exactly synchroniz- 
ing PI: for finite R, H[So\X^] = 0, L > i?. 

If a process admits a presentation that is only weakly 
asymptotically synchronizing, though, then an observer 
will be in various conditions of state uncertainty un- 
til the limit L — > oo. Finitary e-machines, as it turns 
out, are always weakly asymptotically synchronizing and 
the state uncertainty vanishes exponentially fast [35j : 
Pr{H[So\X^] >0) (xe-^. 

The controllability properties of a process and its mod- 
els are analogous. However, now there is a designer that 
has built an implementation of a process. And, starting 
from an unknown condition, the designer wishes to pre- 
pare the process in a particular state or set of states by 
imposing a sequence of inputs. Phrased this way, one sees 
that the implementation is, in effect, a presentation and 
the control sequence is none other than a synchronizing 
word. Due to this duality, we only discuss synchroniza- 
tion in the bulk of our development, returning at the 
end to briefly draw out interpretations of the results for 
controllability. 

B. Synchronizing to the e-Machine 

We noted that the e-machine directly gives two impor- 
tant information-theoretic properties — the entropy rate 
(/i^) and the statistical complexity (C^) — and one (the 
excess entropy E) indirectly. The difference between C^ 
and E was introduced as the crypticity [531 US] 



X^C, 



E 



(40) 
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to describe how much of the internal state information 
(C^) is not locally present in observed sequences (E). 

Synchronization, as we discussed, is a property of the 
recurrent portion of the e-machine and since it is unifi- 
lar, if one knows its current state and follows transitions 
according to the word being considered, then one will al- 
ways know the e-machine's final state. However, it is also 
useful to consider the scenario when one does not know 
the e-machine's current state. Given no other informa- 
tion, the best estimate for the current state is to draw 
from the stationary state distribution Pr(5). Then, as 
each symbol is observed, one updates this belief distri- 
bution and estimates the next state from this updated 
distribution. 

As noted above, H[Sl\Xq] converges to zero exponen- 
tially fast for all e-machines with a finite number of re- 
current causal states. At each L before that point, there 
is an uncertainty in the causal state of the e-machine. If 
we add up the uncertainty at each length, then we have 
the synchronization information: 



^ 

tq 



Cm 
E 













H[X^] S 




E + Lh^ 


^ 


■^ 




- 



L [symbols] 



R 



FIG. 1. Block entropy and block-state entropy growth for a 
generic finitary stationary process: It is easily seen that the 
synchronization information upper bounds the transient in- 
formation, T < S, as T is a component of S. The Markov 
order R and cryptic order k are also shown in their proper re- 
lationship k < R: R indicates where the block entropy meets 
the E + hi_iL asymptote and k, where the block-state entropy 
meets the same asymptote. 



oo ^ 



L=0 



(41) 
(42) 



Importantly, the second line shows that synchronization 
information can be visualized as the sum of all differences 
between the block-state and the block entropy curves. 



Moreover, starting from Eq. (42) we find 



S^J2["i^o,SL]-{-E + Lh, 



L=0 

CO / 



(43) 
(44) 



We know that T = — Xq. When we identify Jq with a 
separate, nonnegative information quantity we conclude 
immediately that S > T. This relationship is shown 
graphically in Fig. [T] 

The cryptic order k, as defined in Ref. [^, can be 
interpreted as the length at which the block-state curve 
has converged to its asymptote: E + h^L. Surprisingly, 
this is not the length at which an e-machine can be con- 
sidered synchronized, which is given by the Markov order 
R. Given its definition as the smallest value L for which 
H[Sl\Xi^] — 0, we see that the cryptic order can be in- 
terpreted as a measure of how far back in time the state 
sequence can be retrodicted from the distant future. 

For example, the Even Process consists of all bi-infinite 



sequences that contain even-length stretches of Is sepa- 
rated by at least a single 0; see Ref. !l2j. This process 
cannot be considered synchronized at any finite length 
because all the thus-far seen symbols may be Is, and so 
one does not know if the latest symbol is a 1 at an even- 
or odd-valued location. In contrast, once a has been 
seen, we know instantly the evenness and oddness of each 
preceding 1, making the cryptic order fc = 0. Since the 
cryptic order fc = for the Even Process, one concludes 
that Jo does not contribute to S and T = S. 

The two pieces — j7o and Iq — comprising S are both fi- 
nite due to the exponentially fast convergence of the two 
block-entropy curves [35]. This shows that S consists of 
distinct information contributions drawn from different 
process features. Referring to Fig. [l] the lower piece, the 
transient information T, is information recorded due to 
an over-estimation of the entropy rate ft.^ at block lengths 
L less than the Markov order R. This over-estimation is 
due, in effect, to L being shorter than the longest corre- 
lations in the data. In a complementary way, the upper 
portion j7o can be viewed as the amount of state informa- 
tion that cannot be retrodicted, even given the infinite 
future. 

The relative roles of the contributions to synchroniza- 
tion information can be clearly seen for one-dimensional 
range-i? spin systems. Reference [12] claimed that for 
spin chains: 



S = T 



1 



R{R + l)hf, 



(45) 



where R is the coupling range (Markov order) of the spin 
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chain. This can be estabhshed rather directly, and under- 
stood for the first time, using the geometric convergence 
picture just introduced for S. First, Ref. [53| showed 
that for a spin chain H[Xq,Sl\ is flat (zero slope) for 
< L < R, after which it converges to its asymptote. 
Second, combining these, we have: 



Jo = Y. H[X^,Sl] - (E + Lh^) (46) 

R 

= ^(E + i?/^^)-(E + L/i^) (47) 

R 

= Y.^R-L)h^ (48) 

-R{R+l)h^. (49) 



L=0 
1 

2' 



So, the amount of state information that cannot be retro- 
dieted is quadratic in Markov order. 

Finahy, H[Xq] and H[Xq,Sl\ give lower and up- 
per bounds on E, respectively: the first monotonically 
approaches E -I- i/i^ from below and the second mono- 
tonically approaches it from above. This way, given an 
e-machine, it is simple to compute E with any accuracy 
required from the block entropies, which themselves can 
be efficiently estimated from the e-machine. Similarly, 
since H[X^] over-estimates the entropy rate while ap- 



proaching from above and 



H[X^,Sl] 



under-estimates 



the entropy rate while approaching from below, one ob- 
tains an analogous pair of bounds on h^. This block-state 
technique for bounding the entropy rate, however, holds 
for any type of presentation of the process. (Cf. Ref. [51] 
Sec. 4.5].) 

VI. PRESENTATION QUANTIFIERS 



The development and results have focused, so far, on 
e-machines and their information-theoretic properties. 
Due to the e-machine's uniqueness, these were also prop- 
erties of the corresponding processes themselves. Now, 
we relax the defining characteristics of e-machines to con- 
sider generic presentations. Naturally, this destroys our 
ability to directly identify presentation properties with 
those of the process represented. A process's entropy 
rate (/i^) and excess entropy (E) remain unchanged, how- 
ever, since they are defined solely through its observables 
Pr(A,A). Widening our purview to generic presenta- 
tions leads us to briefly introduce several new proper- 
ties that capture information processing in presentations. 
Perhaps more distinctly, this also leads us to quantify the 
kinds of information in a presentation that are not char- 
acteristics of the process it represents. Section [VIII then 
provides more detailed expositions on their meaning and 



example processes to illustrate them. 
A. Crypticity 

The statistical complexity C^ is the amount of infor- 
mation a process must store in order to generate future 
behavior. The crypticity x is that part of C^ not trans- 
mitted to the future: x = C^ — E. Roughly, it can be 
thought of as the irreducible overhead that arises from 
the process's causal structure. Reference [22] defined 
crypticity for e-machines as x = ^[^ol^o]- Now, we 
generalize this to define crypticity for generic presenta- 
tions. 

Definition 5. The presentation crypticity x(^) *'* ihe 
amount of state information shared with the past that is 
not transmitted to the future: 



X 



= /[X;7^o|^o] 



(50) 



When the presentation states are causal states, this 
quantity reduces to the original definition — the process's 
crypticity. Furthermore, the crypticity is the differ- 
ence between the presentation state entropy and the y- 
intercept of block-state entropy curve, Eq. (|34|. 



Theorem 3. The presentation crypticity x(J^) ^s the dif- 
ference between the presentation state entropy H[TZo] and 
the suhlinear part of the block-state entropy: 



x = -Ji 



(51) 



Proof. Starting with the length-L approximation of the 
crypticity, we work our way to the L"^ partial sum of 
— J7i via a straightforward calculation: 



I[X^_^-Tl,\X^] 




= H[XbL\X^]- H[X^^\n^,X^] 


(52) 


= H[X^l\X^]-H[X^l\'R^] 


(53) 


= Lh,, - H[X^^\na] + H[X^^\X^] - Lh^ 


(54) 


= -ML)+H[X'lL\X^]-Lh^ 


(55) 


= -Ji{L) + H[Xli\X^i}~Lh^ 


(56) 


L — 1 

= -Ji{L) + > ; H[X,\X^„X^^] - Lh, 


(57) 


2L-1 




= -ML) + > ; H[X,\X^+'] - Lh^ . 


(58) 


J=L 





Equation ( 53 ) follows because the states (in any hidden 



Markov model) shield the past from the future: the fu- 
ture is a function of the state. Equation (551 follows 



from the definition of J7i , and Eq. ( 56 ) from stationar- 



ity. Equation (571 follows from the chain rule for block 
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entropies [H] , and Eq. ( 58 1 from using stationarity again. 
Finally, we take the large-L limit. By definition, we 
have J^i{L) — > Ji. The remaining difference converges 
to zero due to a result in Ref. |35J that the conditional 
block entropies converge to the entropy rate faster than 
linearly in L. D 



B. Oracular Information 

We now introduce a sibling of crypticity — the oracular 
information. 

Definition 6. The oracular information is the amount 
of state information shared with the future that is not 
derived from the past: 



C = /[7^o;^olX] 



(59) 



This new quantity is always zero for the e-machine and 
nonzero only for nonunifilar presentations. It is the dif- 
ference between presentation statistical complexity and 



the y-intercept of the state-block entropy curve, Eq. (33 1 



Tiieorem 4. The oracular information is the difference 
between the presentation state entropy H[TZq\ and the 
suhlinear part of the state-block entropy curve: 



C = -/Ci . 



(60) 



Proof. The proof proceeds almost identically to the cor- 
responding result for crypticity. Namely, 

/[7^o;Xo^|X^i]=-/Cl(L) 

2L-1 

+ Y,H[X,\Xl]-Lh,. (61) 

1 = L 



Then, taking the large-i limit proves the result. 



D 



In this sense, a positive oracular information indicates 
that there is a deficit in using only the rival states for pre- 
diction. More information — the oracular information — 
must be extracted from the presentation in order to per- 
form optimal prediction. 

C. Gauge Information 

When moving away from the optimal representation 
afforded by a process's e-machine, it is possible to en- 
counter presentations containing state information that 
is not justified by a process's bi-infinite set of observables. 
We call this gauge information to draw a parallel with 
the descriptional degrees of freedom that gauge theory 
addresses in physical systems |37) . 



Definition 7. The gauge information is the uncertainty 
in the presentation states given the entire past and future: 



(^ = i/[7^o|^o,^o] ■ 



(62) 



That is, to the extent there is uncertainty in the states, 
even after the past and the future are known, the presen- 
tation contains state uncertainty above and beyond the 
process. Thus, there are components of the model that 
are not determined by the process; rather they are the 
result of a choice of presentation. 

Intuitively, gauge information can be related to the to- 
tal state entropy, crypticity, oracular information, and 
excess entropy. Later, we will discuss information dia- 
grams as a useful visualization tool, but for now, we sim- 
ply point out that one can "visually" verify the following 
theorem from Figure [TSJ 

Theorem 5. Gauge information is the difference be- 
tween the state entropy and the sum of the crypticity, 
oracular information, and excess entropy: 



^ = H[n] - (x + C + E) 



(63) 



Proof. Since we are working with hidden Markov models, 
the future and past are conditionally independent given 
the current state. Thus, E = I\X;^] = /[^;7^;]?]. 
Now, the proof proceeds as a simple verification: 

X(i) + C(i) + E(L) = /[^^7^|^^] 

= iJ[7^]-iJ[7^|^^,^^] . 

So, finite-length approximations to the gauge information 
can be written as: 

H\n\ - {x{L) + C(£) + E(L)) = i/[7^|^^ ^^] . 

Taking the limit, we achieve our desired result. D 

D. Synchronization Information 

As we noted, it is always possible to asymptotically 
synchronize to an e-machine with a finite number of 
recurrent causal states. For some processes, synchro- 
nization can happen in finite time. While in others, it 
can only happen in the limit as the observation window 
tends to infinity. In either case, it is always true that 
H[S^\J^]=0. 

When we generalize to presentations that differ from 
e-machines, it is no longer true that one always synchro- 
nizes to the presentation states. In such cases, there is 
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irreducible state uncertainty, even after observing an in- 
finite number of symbols. This kind of state uncertainty 
cannot be reduced by past observations alone. Due to 
this, the synchronization information, as previously de- 
fined, diverges. 

Definition 8. The presentation synchronization infor- 
mation is the total uncertainty in the presentation states: 



L=0 



L\^o] 



(64) 



We will show in Sec. IVIIil that this can be understood 
in terms of the gauge and oracular informations. 

E. Cryptic Order 

The cryptic order was defined in Ref. 24J as the min- 
imum length k for which H[Sk\Xo\ — 0. Reference [36] 
shows that the cryptic order is a topological property of 
the irreducible sofic shift [32J describing the support of 
the e-machine. However, we can understand the cryp- 
tic order geometrically as the length fc^^, at which the 
block-state entropy H[Xq ^Sl] reaches its asymptote; see 



Eq. (33 1. It turns out that this concept generalizes di- 



rectly to generic presentations. 

Definition 9. The presentation cryptic order is the 
length k at which the block-state entropy curve reaches 
its asymptote: 

k^ = min {L : H[X^ ^TZl] = H[JZo] - X + h^,L} . (65) 

One would like to understand the cryptic order in 
terms of an explicit limit, as done for e-machines, where 
cryptic order is the minimum k for which iJ[iSfe|Xo] = 0. 
The obvious complication for presentations, in general, 
is that one might never synchronize to a particular state. 
However, it turns out that one can understand the pre- 
sentation cryptic order in terms of one's uncertainty in 
the distribution over distributions of states — that is, the 
uncertainty in the distribution over mixed states |23l I38j . 
Specifically, we frame the generalized cryptic order in 
terms of synchronizing to distributions over presentation 
states. We outline the approach briefly; a detailed expo- 
sition will appear elsewhere [55] . 

As measurements are made, an observer's uncertainty 
in the state of the presentation varies. However, the pat- 
tern of variation becomes regular as more observations 
are made. The cryptic order, then, is understood as the 
number of distributions over presentation states that one 
cannot know with certainty from time i = given the en- 
tire future. Said differently, the cryptic order is the time 
at which an observer becomes absolutely certain about 
the uncertainty in the presentation states. 



F. Oracular Order 

The oracular order definition parallels those of the 
cryptic and the Markov orders. 

Definition 10. The oracular order is the length k^ at 
which the state-block entropy curve reaches its asymptote: 

k^ = mm{L:H[no,X^] = H[no]-C + h^L} . (66) 

It always vanishes for e-machines. So, this new length 
scale is a property of the presentation only and not of the 
process generated by the presentation. 

G. Gauge Order 

The gauge order definition also parallels those of the 
cryptic, Markov, and oracular orders. 

Definition 11. The gauge order is the length k^ at 
which H\R.a\X^j^XQ] reaches its asymptote. 



k^ = min{L : if [7^o|X^i, Xq^] - ifi} 



(67) 



Geometrically, we visualize the gauge order as the 
length at which the difference between two curves — 
H[X^^,no,X^] and H[X^^,X^]— becomes fixed to 
their asymptotic difference. 

Tileorem 6. The gauge order is the maximum of the 
Markov, cryptic, and oracular orders: 



kip = max{i?, fc^, fc^} 



(68) 



Proof. The gauge information can be understood as the 
left-over state information after the excess entropy, cryp- 
ticity, and oracular information |39| have been extracted: 



^ = i/[7^o] - E - X - C 



(69) 



Thus, as soon as the observer reaches each of the Markov, 
cryptic, and oracular orders, the remaining state infor- 
mation exactly equals the gauge information. D 

It is important to note that, unlike the Markov, cryp- 
tic, and oracular orders, the gauge order does not indi- 
cate a scale at which an amount of information is con- 
tained. Rather, it is more the opposite. The gauge order 
is the length scale beyond which there is no point at- 
tempting to extract any more state information (even 
with an oracle), precisely because this remainder is the 
gauge information and, therefore, not correlated with the 
process language. It corresponds to what in physics one 
calls a gauge freedom. 
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H. Synchronization Order 



As mentioned in Sec. |VB[ the length at which an ob- 
server has synchronized to an e-machine is always R, the 
Markov order. Recall, any order-i? Markov process has 
I[X n^Xol^o'] — ^- Synchronization to the e-machine 
requires that iI[iSL|X(}'] = 0, and it is straightforward to 
see that this holds for L — R. As we generalize to non- 
e-machine presentations, though, we must look beyond 
Markov order to address the fact that one might only 
synchronize to distributions over presentation states. 

Deflnition 12. The presentation synchronization order 
is the length fcs at which H\R.l\X^] reaches its asymp- 
tote: 



ks = min{L : H[nL\X^] = if + C} 



(70) 



The motivation for this definition is that the asymp- 
tote is simply the difference of the asymptotes for the 
block-state and block entropy curves. That is, the syn- 
chronization order is also thought of as the length at 
which the state uncertainty equals its irreducible state 
uncertainty: (p + ( = H[TZo\Xo]. 

Now, we show that the synchronization order must oc- 
cur at either the presentation cryptic order or the Markov 
order. 

Theorem 7. The presentation synchronization order is 
the maximum of the Markov and presentation cryptic or- 



fcs == max{i?, k^} 



(71) 



Proof. When both the block-state and block entropy 
curves have reached their asymptotes the observer will 
have extracted E + x bits of state information. This 
leaves H[R.n] — E — x = V + C bits. This is exactly 
the irreducible state uncertainty — that which cannot be 
learned from the observables. D 



Note that for e-machines: E 



C„. So, when 



an observer has extracted all that can be learned about 
the process from the past observables, the observer has 
learned everything about the causal states. 

When the synchronization order is finite, H[TZl\Xq] is 
fixed at the presentation's irreducible state uncertainty 
for all L > ks- Then, it can be helpful to view the 
presentation synchronization information as consisting of 
two contributions: 

/es — 1 CO 

s=J2H[nL\x^]+J2{^ + 0- (72) 

L=0 L=ks 

When the synchronization order is not finite, it can be 



useful to interpret the synchronization information in a 
slightly different manner: 



S —Tq + J'q 



lim (if 

L— >-oo 



OL 



(73) 



I. Synchronization Time 



Reference |13) defined the synchronization time r of a 
periodic process to be the average time needed to syn- 
chronize to the states. Let w — wq ■ ■ ■ Wp_i be a cyclic 
permutation of the word that is repeated by a periodic 
process having period p. It follows that 



Pr(^o" 



1 
P 



(74) 



since any cyclic permutation is just as likely as another. 
Now, while each permutation has the same probability, 
it is not true that each permutation is equally informa- 
tive in terms of synchronization. For example, consider 
the process that repeats the word 00011, indefinitely. If 
an observer saw 01, then the observer would be synchro- 
nized. In contrast, the observer would not be synchro- 
nized if 00 had been observed instead. Reference [T3] 
defined r^ as the synchronization time of the cyclic per- 
mutations of w. Then, 






T^ Pr(XP = w) 



(75) 



Since h^ = for all periodic processes, 

Pr(XP ^w)= Pt{X^- ^wo--- wr^^i) . (76) 

Thus, we can rewrite r suggestively as: 

T = ^ r^, Pr(Xg"' ^wo- ■■ Wr^-i) . (77) 

w 

Then, instead of summing over all cyclic permutations 
of w, we can just sum over the set >Csync of all minimal 
synchronizing words. (A word is a minimal synchronizing 
word if no prefix of the word is also synchronizing.) Now, 
we can extend r to all finitary processes, not just periodic 
ones. 

Definition 13. The process synchronization time is the 
average time required to synchronize to the e-machine's 
recurrent causal states: 



r^ ^ HPr(4 



w\ 



loeCs- 



(78) 



Note that any order-i? Markov process has t < R. 
The synchronization time gives an intuition for how long 
it takes to synchronize to a stochastic process. 
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As an example, recall the Even Process [H]. It has 
the property that there are arbitrarily long minimal syn- 
chronizing words. For example, 1*^0 is always a minimal 
synchronizing word, for any k. Despite this fact, the syn- 
chronization time of the Even Process is r = 10/3. After 
repeatedly observing sequences four symbols in length, 
on average an observer will be synchronized to the states 
of the e-machine. 

When considering more general presentations it is not 
always the case that one can synchronize to the states, as 
T can be infinite. Just as with the cryptic order, however, 
one can synchronize to distributions over the presentation 
states. This motivates the presentation synchronization 
time. 

Definition 14. The presentation synchronization time 
is the average time required to synchronize to a recurrent 
distribution over presentation states. 

We provide an intuitive definition here, leaving a more 
detailed discussion, where notation is properly developed, 
for a sequel. 

VII. CLASSIFYING PRESENTATIONS 

The e-machine is frequently the preferred presenta- 
tion of a process, especially when one is interested in 
understanding fundamental properties of the process it- 
self. However, one might be interested in the properties 
of particular presentations of a process, and it would be 
helpful if there was an analogous theory similar to that 
which has been developed for e-machines. 

To develop this, we establish a classification of a 
process's presentations. The classes are defined in 
terms of whether a presentation is nonunifilar, unifi- 
lar, weakly asymptotically synchronizable, and minimal 
unifilar. The result is shown in Fig. [2] which shows that 
the presentation classes form a nested hierarchy. 

The most general type of presentation is nonunifilar, 
where we allow the possibility that H[TZi\TZq,Xo] > 0. 
Then, unifilar presentations are the subset of nonunifilar 
presentations for which this quantity is exactly zero. In 
the unifilar class, there can be redundant states — states 
from which the future looks exactly the same and also 
states which have the same exact histories mapping to 
them. When we move to weakly asymptotically synchro- 
nizable presentations, all redundant states are removed 
and the remaining states must induce a partition on the 
set of histories that is a refinement of the causal state 
partition; cf. Ref. [2^ Lemma 7]. Finally, minimal unifi- 
lar presentations are the e-machines, whose partition of 
the pasts is the coarsest one possible. 

In this light, one might conclude that e-machines are 
an overly restricted set of presentations. They are indeed 



Nonunifilar 




FIG. 2. The hierarchy of presentations of a finitary pro- 
cess. The gray region represents that portion to which the 
e-machine belongs. 



a restricted set, but it is a restriction with purpose: The 
e-machine is the unique minimal prescient presentation 
within the set of a process's presentations. Moreover, 
all of a process's properties can be determined from its 
e-machine. These facts allow one to purposefully conflate 
properties of the e-machine with process's properties. 

We will use a information diagram (I-diagram) [3 to 
analyze what happens as one relaxes the defining prop- 
erties of the e-machine presentation's random variables. 
With the e-machine, we have the past A, the causal 
states S, and the future X. As we move away from the 
e-machine's causal states, we must consider in addition 
the rival states TZ. 

In total, there are four random variables to consider. 
The full range of their possible information-theoretic rela- 
tionships appears in the information diagram (I-diagram) 
of Fig. [3J However, Appendix [C] shows that 7 of the 15 
atoms (elemental components of the multivariate infor- 
mation measure sigma algebra) vanish. This allows us 
to simplify other atoms dramatically. For example, the 
atom: 

/[^;5;7^;^] (79) 

= I0l;S;^]-I0<:;S]jl\n] (80) 

= l{X;S;^] (81) 

= I01;JI]-I0C-J\S] (82) 

= I0C; ^] - (/[^; 7^; ^\S] + I[)(;^\S, 7^]) (83) 

= I[X;^], (84) 

where we made use of the atoms that vanish. Thus, the 
four-way mutual information simply reduces to the mu- 
tual information between the past and the future — the 
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FIG. 3. The general four-variable information diagram involving A , 5, 7?., and X. The shaded light gray is the generalized 
crypticity X- The yellow is the excess entropy E. The dark gray is the oracular information C,. The hatched area is the 
gauge information ip. Note that this is only a schematic diagram of the interrelationships. In particular, potentially infinite 
quantities — such as, -ff [A] and H[X\ — are depicted with finite areas. 



excess entropy: 



/[^;5;7^;^] 



E 



(85) 



Similar calculations reduce the other information mea- 
sures in Fig. |3] correspondingly. We now consider these 
reductions in turn. 



A. Case: Minimal Unifilar Presentation 

The set of minimal unifilar presentations corresponds 
exactly to the e-machines, up to state relabeling. The 
states in these presentations, the causal states, induce a 
partition of the infinite pasts via the function e( a; ). 

The information diagram and entropy growth plot are 
particularly simple, as seen in Fig. Hand Fig. [5] This sim- 
plicity derives from the efficient predictive role the causal 
states play. Referring to the I-diagram, iJ[5|A] — be- 
cause of determinism of the e(^) map, Eq. (fTsl). Next, 
causal states, as well as all other states we consider, are 
prescient states and so /[a; a|7?.] = 0. These straight- 
forward requirements entirely determine the form of the 
e-machine I-diagram in Fig. HI As we step through the 
space of presentation classes, we will see these relation- 
ships become more complex. 

There are three quantities that require attention in this 




FIG. 4. The information diagram for an e-machine. The 
states of the presentation are causal states and induce a parti- 
tion on the past. The entropy over the states, H\TZo\ = H[So\, 
defines the statistical complexity (C^). The process crypticity 
is the difference of the statistical complexity and the excess 
entropy. 



figure. First, the state entropy H\TV\ is equal to C^ — the 
statistical complexity. This particular state information 
is considered privileged as it is the state information asso- 
ciated with the e-machine and so the process. The excess 
entropy E is the mutual information between the past 
and future and is also exactly that information which 
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L [symbols] 

FIG. 5. Entropy growth for a generic e-machine. H[Xo] and 
H[Xq,Sl] both converge to the same asymptote. H[So,Xq] 
is Hnear. 
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FIG. 6. The e-machine presentation of the Golden Mean Pro- 
cess. 



the (causal) states contain about the future. Lastly, the 
crypticity x is the amount of information "overhead" re- 
quired for prediction using the e-machine. Generally, this 
overhead is associated with the presentation as well as the 
process itself, due to the uniqueness of the e-machine pre- 
sentation. It is the irreducible memory associated with 
the process. At any time, the process itself or a predictive 
model must keep track of C^ bits of state information, 
while only E bits of this information are correlated with 
the future. 

The entropy growth plot. Fig. [5| is also simplified by 
using causal states. In terms of our newly defined inte- 
grals: JCn — for all n and Ji = H[S] — Xi = x- 

A simple example that illustrates all of these points is 
provided by the Golden Mean Process and its e-machine; 
see Fig. pi When the probability p is chosen to be ^ , the 
values of our information measures are Ca — lop 
i bits, and E 



^P — ^"62(3) - 
I = 0.9183 bits, X = i bits, and E = C^ - x = 0.2516 
bits. As we explore alternate presentations, we will re- 
turn to this process as a common thread for explanation 
and intuition. 



B. Case: Weakly Asymptotically Synchronizable 
Presentations 

Let's relax the minimality constraint leaving the 
e-machines for presentations that are nonminimal unifi- 
lar and weakly asymptotically synchronizable. Again, 




FIG. 7. The information diagram for a presentation that 
is weakly asymptotically synchronizable, but not necessarily 
minimal unifilar. The states still induce a partition on the in- 
finite past. The presentation crypticity x(^) is the diiference 
of the state entropy H[TZ] > C^ and the excess entropy E. 



the states correspond to a partition of the infinite pasts, 
but since they are prescient and not minimal unifilar, the 
partition must be a refinement of the causal-state parti- 
tion [in]- 

The effect of this is benign as seen in both the I- 
diagram (Fig. W\ and the entropy growth plot (Fig. Isl) . In 
Fig. [71 weakly asymptotically synchronizability ensures 
that H[TZ\X] = 0. Demanding prescient states deter- 
mines the form of the I-diagram. Figure [7| indicates that 
H[TZ] > H[S]. This is a consequence of TZ being a non- 
trivial refinement of S. 

Examining the entropy growth plot, the increased state 
information is reflected in the values of the block-state 
and state-block entropy curves at i = 0. Additionally, 
it is interesting to note what happens to the cryptic or- 
der. We generalized the definition of cryptic order to 
be that length where the block-state entropy reaches its 
asymptote. Since block-state entropy is nondecreasing, 
this suggests that it might be forced to reach its asymp- 
tote at a larger value of L than the cryptic order for 
the e-machine presentation. We can see that this is in 
fact true by expanding the following joint entropy in two 
ways. Note that we combine variables from two different 
presentations and expand H[Xq,Sl,TZl]: 



H[X/^TZl] 



HITZlIX^Sl] 
= HIUlIX^Sl] 
> H[X/^Sl] . 






H[Sl\X^TZl] 



(86) 



In the above, we make use of the fact that 7?. is a re- 
finement of S and that conditional entropies are positive 
semi-definite. This shows that the block-state curve for 
the nonminimal presentation lies above or on the curve 
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FIG. 8. Entropy growth for a weakly asymptotically synchro- 
nizing presentation. _H'[X(f] and H[Xq ,Sl\ both converge to 
the same asymptote. H[Sq,Xq\ is linear. H\TV\ is larger than 



for the e-machine presentation. Since block and block- 
state entropies share an asymptote — E -I- Lh^ — the non- 
minimal unifilar block-state entropy will reach its asymp- 
tote at a value greater than or equal to the process's cryp- 
tic order. More care will be required in the subsequent 
cases, as the relations among entropy growth functions 
are more complicated. 

To illustrate these class characteristics, consider the 
following three-state presentation of the Golden Mean 
Process in Fig. [9] The original causal state partition, 
{A = *l,i3 = *0}, has become refined. (Here, * de- 
notes any allowed history.) We now have {A — *11,_B — 
*0, C = *01}. It is straightforward to verify that H\TV\ = 
log2(3) = 1.585 bits. Excess entropy is unchanged as 
it is a feature of the process language and not the pre- 
sentation. As illustrated in Fig. [7) the crypticity grows 
commensurately with H[TZ\. 

We have shown that for weakly asymptotically syn- 
chronizable presentations the presentation cryptic order 
generally will be larger than the cryptic order. It is in- 
teresting to note that it is also possible for the presen- 
tation cryptic order to surpass even the Markov order. 
Our three-state example (Fig. l9| is 2-cryptic while the 
Markov order remains i? = 1 as it also depends only on 
the process language. 

Since the Markov order R bounds the domain of the 
To integral and the presentation cryptic order k bounds 
the domain of the J^o integral, the domain of the syn- 
chronization information is bounded by max{i?, k}. 



C. Case: Unifilar Presentations 

Removing the requirement that a presentation be 
weakly asymptotically synchronizable, we no longer oper- 
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FIG. 9. A weakly asymptotically synchronizable and non- 
minimal unifilar presentation of the Golden Mean Process: 
observing a synchronizes the observer to state B. 



ate with (recurrent) states that correspond to a partition 
of the infinite past, but rather to a covering of the set 
of infinite pasts. That is, r]{x) can be multivalued, al- 
though for each p E TZ, rj~^{p) is a set of pasts that is a 
subset of some causal state's set of pasts. 

Every allowable infinite history induces at least one 
state in the presentation — this is the definition of an al- 
lowable infinite history. Additionally, any presentation 
that is not weakly asymptotically synchronizable must 
have a (positive measure) set of histories where each his- 
tory induces more than one state. 

Consider a unifilar presentation and an infinite history 
which induces only one state. Due to unifilarity, we can 
use this history to construct an infinite set of histories 
that are also synchronizing. We conjecture that this set 
of histories must have zero measure and, even stronger, 
that for finite-state unifilar presentations with a single 
recurrent component, there are no synchronizing histo- 
ries. 

This inability to synchronize, a product of the nontriv- 
ial covering, is represented as the information measure Lp 
in Fig. |10[ This information is not captured by the causal 
states. In fact, it is not even captured by the past (or 
the future). It also is not necessary for making predic- 
tions with the same power as the e-machine. Like x(7?.), 
Lp is unnecessary for prediction. However, unlike x(7^). 
If does not capture any structural property of the pro- 
cess. Instead, it represents degrees of freedom entirely 
decoupled (informationally) from the process language 
and prediction. For this reason, we called it the gauge 
information. 

The entropy growth plot of Fig. [TT] has a new and sig- 
nificant feature representing the change in class. The 
asymptotes of the block entropy and block-state entropy 
become nondegenerate. This has the effect of making 
the synchronization information diverge. Although this 
fact follows immediately from the definition of weakly 



19 




l-p\0 



E 



FIG. 10. The information diagram for a presentation that is 
not weakly asymptotically synchronizable, but still unifilar. 
The states are prescient, but no longer induce a partition on 
the infinite past. Furthermore, the states contain information 
that the past does not contain. The presentation crypticity is 
the difference ofthestate entropy H[TZo] > C^ and the excess 
entropy E = /[aq; Aq]. 
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FIG. 11. Entropy growth for a not weakly asymptotically syn- 
chronizable, but unifilar presentation. J'y[Xo'] and H[Xq ,Sl\ 
both converge to different asymptotes. H[So,Xq\ is linear. 
H\Tl\ is larger than C^. 



asymptotically synchronizable, it is instructive to see its 
geometric representation. 

Since, from this point forward, synchronization infor- 
mation is always infinite, we find it necessary to re- 
express what synchronization information means. It 
can be denoted, recall Eq. (73), as the sum of a fi- 
nite piece and the limit of a linear (in L) piece: S = 
Iq + i7o + finiL->.oo Lip. This rate of increase of the linear 
piece is exactly the gauge information. 

It is also interesting to note that when this informa- 
tion is obtained — that is, a constraint is imposed upon 
the descriptional degrees of freedom — unifilarity main- 
tains synchronization as more data is produced. In this 
sense, acquiring gauge information is a "one-time" cost. 
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FIG. 12. A unifilar, but not weakly asymptotically synchro- 
nizing, presentation of the Golden Mean Process. 



The Golden Mean Process presentation in Fig.[T2]illus- 
trates all of the features described above. It is straightfor- 
ward to see that this presentation is not weakly asymp- 
totically synchronizing. Any history, finite or infinite, 
has exactly two states that it induces. This degener- 
acy is never broken, due to unifilarity. Rephrasing, the 
gauge information value, ip — \ bit, derives from the fact 
that each infinite history induces one of two states with 
equal likelihood. This relies on the fact that there is no 
oracular information contribution — C, = bits since the 
presentation is unifilar — to disentangle from the gauge 
information. 



D. Case: Nonunifilar Presentations 

Finally, we remove the requirement of unifilarity 
and examine the much larger, complementary space of 
nonunifilar presentations. Only one nonunifilar state 
must be present to change the class of the whole presenta- 
tion. This ease of breaking unifilarity is why nonunifilar 
presentations form a much larger class. 

Examining the Tdiagram in Fig. |13[ we notice one 
new feature: the oracular information C, = I\Ti] A |a] = 
I\Ti.;X\S] 7^ 0. The oracular information is a curious 
quantity and so deserves careful interpretation. It is 
the degree to which the presentation state reduces un- 
certainty in the future beyond that for which the past 
can account. One might think of this feature as "super- 
prescience". Not only is the information from the past 
being maximally utilized for prediction, but some ad- 
ditional information is also injected. We make several 
remarks about this. 

It is well known that a process's nonunifilar presenta- 
tions may be smaller than the corresponding e-machine. 
This fact is sometimes cited [ST] as providing evidence 
that the smaller nonunifilar presentation is the more 
"natural" one [ID]. While it is true that the state infor- 
mation H\R,] can be smaller than C^, and in fact often is, 
the I-diagram makes plain the fact that oracular informa- 
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result is that over time synchronization is repeatedly lost 
and reacquired. 



FIG. 13. The information diagram for a presentation that is 
not unifilar. The states are super-prescient, do not induce a 
partition on the past, and have information not contained in 
the past. The presentation crypticity is the difference of the 
state entropy H[TZ] and the excess entropy (E). Note, the 
state entropy can also be smaher than the statistical com- 
plexity. 



tion must be introduced to determine TZ and, thus, make 
a super-prescient prediction. For this reason, unless one 
is transparent about allowing for oracular information, it 
is not appropriate to make a judgment about naturalness 
of nonunifilar presentations. 

Given that we do not have the luxury of access to an 
oracle, we might like to know how these presentations 
perform without this information. The nonoracular part 
of I[TZ; A ] is simply E. That is, without the oracu- 
lar information, we predict just as we would with any 
other prescient presentation. However, the predictions 
are made using distributions over states rather than in- 
dividual states. (The former are the mixed states of Ref. 
|23].) More importantly, as we continue to make pre- 
dictions, the state distribution evolves through a series 
of distributions. These distributions are in 1-to-l corre- 
spondence with the causal states of the e-machine. And 
so, for a nonoracular user of a nonunifilar presentation 
to communicate her history-induced state to another re- 
quires the transmission of C^ bits. The statistical com- 
plexity is inescapable as the (nonoracular) information 
storage of the process. 

When discussing nonweakly asymptotically synchro- 
nizable, but unifilar presentations, we indicated that the 
gauge information was a "one-time" cost. Now, we ask 
the same question of the two informations — gauge and 
oracular — that are not products of the past. Since we no 
longer have unifilarity, state uncertainty is dynamically 
reintroduced as synchronization is lost. That is, nonunifi- 
lar presentations are allowed to locally resynchronize fol- 
lowing the introduction of state uncertainty. The net 



The entropy growth plot of Fig. 14 makes one last ad- 
justment to acknowledge the change in class. For the first 
time, the state-block entropy is nonlinear. It approaches 
its asymptote from above and, moreover, the asymptote 
is independent of the block-state asymptote. The pro- 
jection back onto the y-axis mirrors our final and most 
general I-diagram of Fig. [13] The left panel emphasizes 
that the crypticity x(^) can be less than the oracular 
information C, in general cases. 

A nonunifilar presentation of the Golden Mean Process 
is shown in Fig. |15[ All of the above-mentioned quanti- 
ties are nonzero for this presentation: For p = 1/2, the 
crypticity x(^) = 1/3 bits, the gauge information ip = 1 
bit, and the oracular information C = 1/3 bits. The value 
of the gauge information (1 bit) is easy to understand. It 
indicates that the nonunifilar presentation is two copies 
of a unifilar presentation of the Golden Mean Process su- 
tured together. All of history space is covered twice and 
the choice of which component of the cover is visited is 
a fair coin flip. The crypticity and oracular information 
(crypticity's time-reversed analog) are the same, due to 
the nonunifilar presentation respecting the time-reverse 
symmetry of the Golden Mean Process |23] . 



VIII. CONCLUSIONS 

Our development started out discussing synchroniza- 
tion and control. The tools required to address these — 
the block-state and state-block entropies — quickly led to 
a substantially enlarged view of the space of competing 
models, the rival presentations, and a new collection of 
information measures that reflect their subtleties and dif- 
ferences. 

As milestones along the way, we gave example pre- 
sentations of the well known Golden Mean Process that 
went from the e-machine to a nonminimal nonsynchro- 
nizing nonunifilar presentation. Table IT] summarizes the 
quantitative results. It gives the entropy rate h^, statisti- 
cal complexity C^, excess entropy E, and the crypticity x 
for the process itself. Immediately following, it compares 
the analogous measures for the range of presentations 
considered. In addition, the gauge information ip and 
the oracular information C, being properties of presenta- 
tions, are added. Careful study of the table shows how 
the measures track the presentations' structural changes. 

A few comments are in order about the tools the de- 
velopment required. The first were the block-state and 
state-block entropies, as noted. Analyzing their word- 
length convergence properties was the backbone of the 
approach — one directly paralleling the previously intro- 
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FIG. 14. Entropy growth for a nonunifilar presentation. Left: H[Xq\ and H[Xq ,Sl\ both converge to different asymptotes; 
H[Sq,Xq\ is not hnear and H\Ti,] is larger than C^j. Right: The same as on the left, but illustrating that x can be less than ^. 
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FIG. 15. A normnifilar presentation of the Golden Mean Pro- 
cess. 



duced entropy convergence hierarchy p^2l. Another im- 
portant tool was the I-diagram. While it is not necessary 
in establishing final results, it is immensely helpful in 
organizing one's thinking and in managing the compli- 
cations of multivariate information measures. Method- 
ologically speaking, the principal subject was the four- 
variable — past, future, causal state, and presentation 
state — I-diagram with its sigma algebra of 15 atoms. 
Thus, the methodology of the development turned on 
just two tools — block entropy convergence and presenta- 
tion information measures. 

As for the concrete results, we showed that there are 
two mechanisms operating in processes that are hard to 
synchronize to, as measured by the synchronization in- 
formation which consists of two corresponding indepen- 
dent contributions. The first is the transient information 
which reflects entropy-rate overestimates that occur at 
small block lengths. The second, new here, reflects the 
state information that is not retrodictablc using the fu- 
ture. With these two contributions laid out, the general 
connection between synchronization and transient infor- 
mation, previously introduced in Ref. |12) . became clear. 
We also pointed out that the synchronization information 
for nonsynchronizing presentations can diverge. This, in 



turn, called for a generalized definition of synchronization 
appropriate to all presentations. 

We also generalized the process crypticity, beyond the 
domain of e-machine optimal presentations, to describe 
the amount of presentation state information that is 
shared with the past but not transmitted to the future. 
A sibling of the crypticity, we introduced a new infor- 
mation measure for generic presentations — the oracular 
information — that is the amount of state information 
shared with the future, but not derivable from the past. 

Finally, to account for "components" , either explicitly 
or implicitly included in a presentation, that are not jus- 
tified by the process statistics, we introduced the gauge 
information, intentionally drawing a parallel to the con- 
cept of gauge degrees-of-freedom familiar from physics. 

One immediate result was that the information mea- 
sures allowed us to delineate the hierarchy of a process's 
presentations. The hierarchy goes from the unique, min- 
imal unifilar, optimal predictor (e-machine) to nonmini- 
mal unifilar, weakly asymptotically synchronizing presen- 
tations to nonsynchronizing, unifilar presentations. We 
showed these are nested classes. Stepping outside to the 
nonunifilar presentations leaves one in a markedly larger 
class for which all of the information measures play a 
necessary role. 

We trust that the presentation hierarchy makes the 
singular role of the e-machine transparent. First, the 
e-machine's minimality and uniqueness are those of the 
corresponding process. This cannot be said for alterna- 
tive presentations. Second, there is a wide range of prop- 
erties that can be efficiently calculated, when alternative 
presentations may preclude this. One cannot calculate 
a process's stored information (C^) or information pro- 
duction rate {h^) from, for example, nonunifilar presen- 
tations. The latter must be converted, either directly or 
indirectly, to the process's e-machine to calculate them. 



Information Measures for Alternative Presentations 



Process 


hf. 


Cm 


E 


X 


Golden Mean 


2/3 


log2(3)-2/3 


log2(3)-4/3 


2/3 
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Presentation 


H[X\TZ] 


H[n] 


/[7^;^] 


x{n) 


f 


C 


e-Machine 


hf. 


c. 


E 


X 








Synchronizable 


hf. 


log2(3) 


E 


4/3 








Unifilar 


hf. 


log2(3) + l/3 


E 


5/3 


1 





Nonunifilar 


1/3 


log2(3) + l/3 


log2(3) - 1 


1/3 


1 


1/3 



TABLE I. Comparison of information measures for presentations of tiie Golden Mean Process with transition parameter p = 1/2. 



Nonetheless, as discussed at some length in Ref. [57], 
in varying circumstances — limited material, inference, or 
compute-time resources; ready access to sources of ideal 
randomness; noisy implementation substrates; and the 
like — the e-machine may not be how an observer should 
model a process. A minimal nonunifilar presentation, 
that is necessarily more stochastic internally than the 
e-machine [29j . may be preferred due to it having a 
smaller set of states. 

Recalling the duality of synchronization and control, 
we close by noting that essentially all of the results here 
apply to the setting in which an agent attempts to steer 
a process into desired states. The efficiency with which 
the control signals achieve this is reflected in the ana- 
logue of block entropy convergence. The very possibility 
of control has its counterparts in an implementation hi- 
erarchy that mirrors the presentation hierarchy, but with 
controllability instead of synchronizability. 



where 

A2if(l) = hf,{l) - hf,{0) - H{1) - log2 1^1 . (A2) 

The logarithm term characterized the entropy rate es- 
timate before any probabilities are considered. In the 
modified definition of total predictability, we drop the 
boundary term, giving: 



G = X2 = 5] A^H{L) 



(A3) 



L=2 



The two quantities are related by: 



Gruro - G + A^H{1) (A4) 

= G + H{1) - log2 1^1 . (A5) 

This affects relationships involving G. Previously, for 
example. 



Appendix A: Notation Change for Total 
Predictability 



The definition for X„ in Eq. (14 1 — the total 



predictability — represents a minor change in notation 
from Ref. [^. (We refer to the latter as RURO, ab- 
breviating its title.) There, the minimum L was usually 
n except for ri = 2, when the minimum L value was 
L = 1 instead. One reason for the change in definition is 
that Z2 now does not depend on any assumption (prior) 
for symbol entropy rate and depends only on asymptotic 
properties of the process. 

To make this explicit, note that the original definition 
of total predictability contained a boundary term: 



Gruro — ^ < , 

where R is the total redundancy. Now, 

G = -n-A'^Hil) 

= log2 1^1 - H{1) - R . 



(A6) 



(A7) 
(A8) 



Appendix B: State-Block Entropy Rate Estimate 

In this section, we prove Thm. [l] which states that 
H[Xl\TZq,Xq] converges monotonically (nondecreasing) 
to the entropy rate. 



G 



RURO 



A2ff(l) 



L=2 



A^H{L) 



(Al) 



Proof. First, we show that difference in the H[}Zq,Xq] 
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forms a nondccreasing sequence: 

H[XL-i\n^,X^-'] (Bl) 

= if[Xi|7^l,Xf-l] (B2) 

= i7[Xi|7^o,Xo,7^l,Xl^-l] (B3) 

<iJ[Xi|7^o,Xo,Xf-l] (B4) 

= if[Xi|7^o,Xo^]. (B5) 

Next, we show this sequence is bounded and, thus, has a 
limit. For all fc > 0, we have: 



= H[Xl\X_,^,TIi^,Xq] 
<H[XL\Xtt''] 



H[X 



L+k\Xf^ 



L+fcl 



(B6) 

(B7) 
(B8) 
(B9) 



Since this holds for all fc, it also holds in the limit as fc 
tends to infinity, which is the definition of the entropy 
rate. Thus, H[Xl\R.q^X^] is a nondccreasing sequence 
and bounded above by h^. 

Finally, we show that this bounded sequence converges 
to ft,^. To do this, we will show that the difference 

H[X^\X^] - H[X^\n^,X^] = /K;XjXo^] 

converges to zero. Then, since the first term (differences 
in the block entropies) is known to converge to the en- 
tropy rate, the claim will be proved. We have: 



H[no\> lim i[n^-xl^,x^] 



= hm Y.i[n,-x^\x^,] 



(BIO) 
(Bll) 



fc=0 



Since the sum is finite, the terms must tend to zero. D 

Appendix C: Reducing the Presentation I-Diagram 

Proving that the various multivariate information mea- 
sures vanish makes use of a few facts about states: 

• H[S\X] = 0. 

• H[X:^\S] =0. 

• i{x;^\n] = H[^\n] - H[t\n, % = o. 

The last one follows from limiting ourselves to states that 
actually generate the process. Thus, additional condi- 
tioning on the past cannot influence the future, as the 
current state alone determines the future. 
The following atoms vanish: 



• H[S\X,n,^]: 

H\S\X,'R,'f] <H[S\X] ^{) . 

• /[5;7^|^,^]: 

I[S] 7^|^, ~t] = H[S\X, ~t] - H[S\X, TZ, ~t] 

< HiSl'X] 
= . 

• /[5;7^;^|^]: 

I[S;n;^fX] =/[5;7^|^]-/[5;7^|^,^] 
= /[5;7^|^]-0 
= HiSl'X] - H[S\TZ,'X] 
= - H[S\n,% . 

Finally, note that 

\H[S\n,^]\<\H[S\^]\ 
= . 

• /[5;^|^,7^]: 

I[S;^\^,n] ==iJ[5|^,7^]-iJ[5|^,5",7^] 

= H[s\^ ,n] - 

< H[S\1{] 
= . 

• I[X;^\S,TZ]: 

I[^;^\S,n] =iJ[^|5,7^]-iJ[^|5,7^,^] 

= H(jl\s,n] - H[Jl\s,n] 

= . 

• l{X;n;'^\S]: 

I[X; TZ;^\S] = I[X;^\S] - I[X; ~^\S, 7^] 
= . 

• l{X;S;^\n]: 

= . 

The first four vanish due to the causal states being 
a function of the past. The last three vanish since any 
presentation that generates the process captures all the 
information shared between past and future. 
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